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Abstract — Likelihood based-learning of graphical models faces chal- 
lenges of computational-complexity and robustness to model mis- 
specification. This paper studies methods that fit parameters directly 
to maximize a measure of the accuracy of predicted marginals, taking 
into account both model and inference approximations at training time. 
Experiments on imaging problems suggest marginalization-based learn- 
ing performs better than likelihood-based approximations on difficult 
problems where the model being fit is approximate in nature. 



2 Setup 

2.1 Markov Random Fields 

Markov random fields are probability distributions that 
may be written as 
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Index Terms — Graphical Models, Conditional Random Fields, Machine 
Learning, Inference, Segmentation. 



1 Introduction 

GRAPHICAL models are a standard tool in image pro- 
cessing, computer vision, and many other fields. 
Exact inference and inference are often intractable, due 
to the high treewidth of the graph. 

Much previous work involves approximations of 
the likelihood. (Section |4]). In this paper, we suggest 
that parameter learning can instead be done using 
"marginalization-based'' loss functions. These directly 
quantify the quality of the predictions of a given marginal 
inference algorithm. This has two major advantages. 
First, approximation errors in the inference algorithm are 
taken into account while learning. Second, this is robust 
to model mis-specification. 

The contributions of this paper are, first, the general 
framework of marginalization-based fitting as implicit 
differentiation. Second, we show that the parameter 
gradient can be computed by "perturbation''- that is, 
by re-running the approximate algorithm twice with the 
parameters perturbed slightly based on the current loss. 
Third, we introduce the strategy of "truncated fitting". 
Inference algorithms are based on optimization, where 
one iterates updates until some convergence threshold is 
reached. In truncated fitting, algorithms are derived to fit 
the marginals produced after a fixed number of updates, 
with no assumption of convergence. We show that this 
leads to significant speedups. We also derive a variant of 
this that can apply to likelihood based learning. Finally, 
experimental results confirm that marginalization based 
learning gives better results on difficult problems where 
inference approximations and model mis-specification 
are most significant. 



This is defined with reference to a graph, with one node 
for each random variable. The first product in Eq. [l] is 
over the set of cliques c in the graph, while the second 
is over all individual variables. For example, the graph 




corresponds to the distribution 

X V^(X1)V^(X2)V^(X3)V^(X4)V^(X5)V^(X6). 

Each function V^(xc) or V^(xi) is positive, but otherwise 
arbitrary. The factor Z ensures normalization. 

The motivation for these types of models is the Ham- 
mer sley-Clif ford theorem [IJ, which gives specific condi- 
tions under which a distribution can be written as in Eq. 
[l] Those conditions are that, first, each random variable 
is conditionally independent of all others, given its im- 
mediate neighbors and, secondly, that each configuration 
X has nonzero probability. Often, domain knowledge 
about conditional independence can be used to build 
a reasonable graph, and the factorized representation in 
an MRF reduces the curse of dimensionality encountered 
in modeling a high-dimensional distribution. 



2.2 Conditional Random Fields 

One is often interested in modeling the conditional prob- 
ability of X, given observations y. For such problems, it 
is natural to define a Conditional Random Field [2] 



p(x|y) 
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Here, V^(xc, y) indicates that the value for a particular 
configuration Xc depends on the input y. In practice, the 
form of this dependence is application dependent. 

2.3 Inference Problems 

Suppose we have some distribution p(x|y), we are given 
some input y, and we need to guess a single output 
vector X*. What is the best guess? 

The answer clearly depends on the meaning of "best". 
One framework for answering this question is the idea 
of a Bayes estimator [3]. One must specify some utility 
function /7(x, x'), quantifying how "happy" one is to 
have guessed x if the true output is x^ One then chooses 
X* to maximize the expected utility 

X* = argmaxy^p(x'|y)/7(x, xM. 

x' 

One natural utility function is an indicator function, 
giving one for the exact value x^ and zero otherwise. It 
is easy to show that for this utility, the optimal estimate 
is the popular Maximum a Posteriori (MAP) estimate. 

Theorem. If [/(x, x') = /[x = x'], then 

X* = argmaxp(x|y). 

X 

Little can be said in general about if this utility 
function truly reflects user priorities. However, in high- 
dimensional applications, there are reasons for skepti- 
cism. First, the actual maximizing probability p(x*|y) 
in a MAP estimate might be extremely small, so much 
so that astronomical numbers of examples might be 
necessary before one could expect to exactly predict the 
true output. Second, this utility does not distinguish 
between a prediction that contains only a single error 
at some component Xj, and one that is entirely wrong. 

An alternative utility function, popular for imaging 
problems, quantifies the Hamming distance, or the num- 
her of components of the output vector that are correct. 
Maximizing this results in selecting the most likely value 
for each component independently. 

Theorem. J/[/(x, x') = ^^I[xi = x-], then 

X* = argmaxp(xi |y). (2) 

This appears to have been originally called Maximum 
Posterior Marginal (MPM) inference [4J, though it has 
been reinvented under other names [5]. From a computa- 
tional perspective, the main difficulty is not performing 
the trivial maximization in Eq. |2l but rather computing 
the marginals p{xi\y). The marginal-based loss functions 
introduced in Section 14.21 can be motivated by the idea 
that at test time, one will use an inference method similar 
to MPM where one in concerned only with the accuracy 
of the marginals. 

The results of MAP and MPM inference will be similar 
if the distribution p(x|y) is heavily "peaked" at a single 
configuration x. Roughly, the greater the entropy of 
p(x|y), the more there is to be gained in integrating 



over all possible configurations, as MPM does. A few 
papers have experimentally compared MAP and MPM 
inference (SI, El. 

2.4 Exponential Family 

The exponential family is defined by 

p(x;0) = exp(0-f(x)-^(e)), 

where is a vector of parameters, f(x) is a vector of 
sufficient statistics, and the log-partition function 

A(6') =log^exp6'-f(x). (3) 

X 

ensures normalization. Different sufficient statistics f(x) 
define different distributions. The exponential family is 
well understood in statistics. Accordingly, it is useful to 
note that a Markov random field (Eq. [ij is a member of 
the exponential family, with sufficient statistics consist- 
ing of indicator functions for each possible configuration 
of each clique and each variable [8], namely, 

f(X) = {/[X, = x,]|Vc,xJ U {I[X, = x,]|Vz,x,}. 

It is useful to introduce the notation 6>(xc) to refer 
to the component of 6 corresponding to the indicator 
function /[Xc = Xc], and similarly for 0{xi). Then, 
the MRF in Eq. □ would have V^(xc) = e^^^-^ and 
^(xi) = e^*^^*^. Many operations on graphical models 
can be more elegantly represented using this exponential 
family representation. 

A standard problem in the exponential family is to 
compute the mean value of f , 

X 

called the "mean parameters". It is easy to show these 
are equal to the gradient of the log-partition function. 



For an exponential family corresponding to an MRF, 
computing fi is equivalent to computing all the marginal 
probabilities. To see this, note that, using a similar 
notation for indexing fj, as for above, 

A^(xc; e) = ^p(X; e)I[K, = x^] = p(x,; 0). 

X 

Conditional distributions can be represented by think- 
ing of the parameter vector ^(y; 7) as being a function of 
the input y, where 7 are now the free parameters rather 
than 0. (Again, the nature of the dependence of ^ on y 
and 7 will vary by application.) Then, we have that 

p(x|y;7) = exp(0(y;7) • f(x) - A{e{y;-f))) , (5) 

sometimes called a curved conditional exponential fam- 
ily. 
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2.5 Learning 

The focus of this paper is learning of model parameters 
from data. (Automatically determining graph structure 
remains an active research area, but is not considered 
here.) Specifically, we take the goal of learning to be to 
minimize the empirical risk 

i?(0) = ^L(0,x), (6) 

X 

where the summation is over all examples x in the 
dataset, and the loss function L(^,x) quantifies how 
well the distribution defined by the parameter vector 
matches the example x. Several loss functions are 
considered in Section lU 

We assume that the empirical risk will be fit by some 
gradient-based optimization. Hence, the main technical 
issues in learning are which loss function to use and how 
to compute the gradient 

In practice, we will usually be interested in fitting 
conditional distributions. Using the notation from Eq. 
|5l we can write this as 

^(7)= ^i:(^(y,7),x). 

(y,x) 

Note that if one has recovered ^ is immediate 
from the vector chain rule as 

dL_d0^dL 

Thus, the main technical problems involved in fitting 
a conditional distribution are similar to those for a 
generative distribution: One finds 6 = ^(y,7), computes 
the L and ^ on example x exactly as in the generative 
case, and finally recovers ^ from Eq. [71 So, for simplic- 
ity, y and 7 will largely be ignored in the theoretical 
developments below. 

3 Variational Inference 

This section reviews approximate methods for comput- 
ing marginals, with notation based on Wainwright and 
Jordan |8|. For readability, all proofs in this section are 
postponed to Appendix A. 

The relationship between the marginals and the log- 
partition function in Eq. S] is key to defining approx- 
imate marginalization procedures. In Section 13. 1[ the 
exact variational principle shows that the (intractable) 
problem of computing the log-partition function can be 
converted to a (still intractable) optimization problem. To 
derive a tractable marginalization algorithm one approx- 
imates this optimization, yielding some approximate 
log-partition function A{6). The approximate marginals 
are then taken as the exact gradient of A. 

We define the reverse mapping 0{fjL) to return some 
parameter vector that yields that marginals /j.. While this 
will in general not be unique [8, sec. 3.5.2], any two 
vectors that produce the same marginals fi will also yield 
the same distribution, and so p(x; 6{fi)) is unambiguous. 



3.1 Exact Variational Principle 

Theorem (Exact variational principle). The log-partition 
function can also be represented as 

A{e) = max e - fi^ H{fi), (8) 

where 

is the marginal polytope, and 

H{fi) = - ^p(x; eifi)) logp(x; e{fi)) 

X 

is the entropy. 

In treelike graphs, this optimization can be solved 
efficiently. In general graphs, however, it is intractable in 
two ways. First, the marginal polytope M becomes dif- 
ficult to characterize. Second, the entropy is intractable 
to compute. 

Applying Danskin's theorem to Eq. [8] yields that 

dA 

/J^{0) = -— = argmax6/ • + H{^). (9) 

Thus, the partition function (Eq. [8]) and marginals (Eq. 
^ can both be obtained from solving the same optimiza- 
tion problem. This close relationship between the log- 
partition function and marginals is heavily used in the 
derivation of approximate marginalization algorithms. 
To compute approximate marginals, first, derive an ap- 
proximate version of the optimization in Eq. [8j Next, 
take the exact gradient of this approximate partition 
function. This strategy is used in both of the approximate 
marginalization procedures considered here: mean field 
and tree-reweighted belief propagation. 

3.2 Mean Field 

The idea of mean field is to approximate the exact 
variational principle by replacing M with some tractable 
subset T c Mr such that T is easy to characterize, 
and for any vector /j. e T we can exactly compute the 
entropy. To create such a set J*, instead of considering 
the set of mean vectors obtainable from any parameter 
vector (which characterizes M), consider a subset of 
tractable parameter vectors. The simplest way to achieve 
this to restrict consideration to parameter vectors 6 with 
^(xc) = for all factors c. 

T={ii' : 36, fi' = fi{6), Vc, ^(x^) = 0}. 

It is not hard to see that this corresponds to the set 
oi fully -factorized distributions. Note also that this is (in 
non-treelike graphs) a non-convex set, since it has the 
same convex hull as M, but is a proper subset. So, the 
mean field partition function approximation is based on 
the optimization 

A{e) = max 6/ • + H{fi), (10) 
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with approximate marginals corresponding to the max- 
imizing vector fi, i.e. 



jl{0) = argmax^ • /j, + H{/j.). 



(11) 



Since this is maximizing the same objective as the 
exact variational principle, but under a more restricted 
constraint set, clearly A{6) < A{6). 

Here, since the marginals are coming from a fully- 
factorized distribution, the exact entropy is available as 



(12) 



The strategy we use to perform the maximization in 
Eq. [lOl is block-coordinate ascent. Namely, we pick a 
coordinate j, then set to maximize the objective, 

leaving /i(x^) fixed for all i ^ j- The next theorem 
formalizes this. 

Theorem (Mean Field Updates). A local maximum of Eq. 
[TOl can he reached by iterating the updates 

c:jecxc\j iec\j 

where Z is a normalizing factor ensuring that ^ ^(^j) = 1- 



3.3 Tree-Reweighted Belief Propagation 

Whereas mean field replaced the marginal polytope 
with a subset, tree-reweighted belief propagation (TRW) 
replaces it with a superset, C D M. This clearly can 
only increase the value of the approximate log-partition 
function. However, a further approximation is needed, 
as the entropy remains intractable to compute for an 
arbitrary mean vector /j.. (It is not even defined for 
fji ^ M.) Thus, TRW further approximates the entropy 
with a tractable upper bound. Taken together, these two 
approximations yield a tractable upper bound on the 
log-partition function. 

Thus, TRW is based on the optimization problem 



A{0) =max6/-/L6 + J^(/L6). 



(13) 



Again, the approximate marginals are simply the maxi- 
mizing vector fi, i.e.. 



fi{6) = argmax^ • + H{fi). 



The relaxation of the local polytope used in TRW is 
the local polytope, 

Since any valid marginal vector must obey these con- 
straints, clearly M C C. However, C in general also 
contains unrealizable vectors (though on trees C = M). 



Thus, the marginal vector returned by TRW may, in gen- 
eral, be inconsistent in the sense that no joint distribution 
yields those marginals. 

The entropy approximation used by TRW is 



(16) 



where H{/ii) = — ^x.l~i{xi) log ii{xi) is the univariate 
entropy corresponding to variable i, and 



^(/^c) = ^/i(Xc) log ^ 



(17) 



is the mutual information corresponding to the variables 
in the factor c. The motivation for this approximation is 
that if the constants pc are selected appropriately, this 
gives an upper bound on the true entropy. 

Theorem (TRW Entropy Bound). Let Pr{Q) he a distribu- 
tion over tree structured graphs, and define pc = Pr(c G Q). 
Then, with H as defined in Eq. [TH 

Thus, TRW is maximizing an upper bound on the ex- 
act variational principle, under an expanded constraint 
set. Since both of these changes can only increase the 
maximum value, we have that A{0) > A{0). 

Now, we consider how to actually compute the 
approximate log-partition function and associated 
marginals. Consider the message-passing updates 



^c\i jec\i 



rricixj) 



-, (18) 



where "cx'' is used as an assignment operator to means 
assigning after normalization. 

Theorem (TRW Updates). Let pc he as in the previous 
theorem. Then, if the updates in Eq. [121 reach a fixed point, 
the marginals defined by 

fiLcy^i) 



p{xi) (X e^^^^^ Yi ^d{xi 

d:i^d 



^^^^ constitute the global optimum of Eq. \ 



So, if the updates happen to converge, we have the 
solution. Meltzer et al. show |9| that on certain graphs 
made up of monotonic chains, an appropriate ordering of 
messages does assure convergence. (The proof is essen- 
tially that under these circumstances, message passing 
is equivalent to coordinate ascent in the dual.) 

TRW simplifies into loopy belief propagation by 
choosing pc = 1 everywhere, though the bounding 
property is lost. 
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4 Loss Functions 

For space, only a representative sample of prior work 
can be cited. A recent review |10| is more thorough. 

Though, technically, a 'Toss'' should be minimized, we 
continue to use this terminology for the likelihood and 
its approximations, where one wishes to maximize. 

For simplicity, the discussion below is for the genera- 
tive setting. Using the same loss functions for training a 
conditional model is simple (Section |23)| . 

4.1 The Likelihood and Approximations 

The classic loss function would be the likelihood, with 

L{e, x) = iogp(x; e) = e- f (x) - A{e). (i9) 

This has the gradient 

^=f(x)-M^). (20) 

One argument for the likelihood is that it is efficient; 
given a correct model, as data increases it converges to 
true parameters at an asymptotically optimal rate [llj. 

Some previous work uses tree structured graphs 
where marginals may be computed exactly [[T2|. Of 
course, in high-treewidth graphs, the likelihood and 
its gradient will be intractable to compute exactly, due 
to the presence of the log-partition function A{0) and 
marginals iJi{0). This has motivated a variety of approx- 
imations. The first is to approximate the marginals ii 
using Markov chain Monte Carlo |[T3ll , [[T4|. This can 
lead to high computational expense (particularly in the 
conditional case, where different chains must be run for 
each input). Contrastive Divergence IITSlI further approx- 
imates these samples by running the Markov chain for 
only a few steps, but started at the data points |16|. If 
the Markov chain is run long enough, these approaches 
can give an arbitrarily good approximation. However, 
Markov chain parameters may need to be adjusted to the 
particular problem, and these approaches are generally 
slower than those discussed below. 

4.1.1 Surrogate Likelihood 

A seemingly heuristic approach would be to replace the 
marginals in Eq. |20l with those from an approximate 
inference method. This approximation can be quite prin- 
cipled if one thinks instead of approximating the log- 
partition function in the likelihood itself (Eq. [19]). Then, 
the corresponding approximate marginals will emerge as 
the exact gradient of this surrogate loss. This "surrogate 
likelihood" [17 \ approximation appears to be the most 
widely used loss in imaging problems, with marginals 
approximated by either mean field |[18|, HI, TRW |^ 
or LBP [21J, [22J, [23J, El, El. However, the terminol- 
ogy of "surrogate likelihood" is not widespread and in 
most cases, only the gradient is computed, meaning the 
optimization cannot use line searches. 

If one uses a log-partition approximation that pro- 
vides a bound on the true log-partition function, the 



surrogate likelihood will then bound the true likelihood. 
Specifically, mean field based surrogate likelihood is an 
upper bound on the true likelihood, while TRW-based 
surrogate likelihood is a lower bound. 

4. 1.2 Expectation Maximization 

In many applications, only a subset of variables may be 
observed. Suppose that we want to model x = (z,h) 
where z is observed, but h is hidden. A natural loss 
function here is the expected maximization (EM) loss 

L{e, z) = logp(z; 6) = log ^ p(z, h; 6). 

h 

It is easy to show that this is equivalent to 

L{e,z) = A{e,z)-A{e), (21) 

where A{0, z) = log exp ^ • f (z, h) is the log-partition 
function with z "clamped" to the observed values. If all 
variables are observed A{6^z) reduces to 6 ■ f(z). 

If on substitutes a variational approximation for 
A{0,z), a "variational EM" algorithm [8, Sec. 6.2.2] can 
be recovered that alternates between computing approx- 
imate marginals and parameter updates. Here, because 
of the close relationship to the surrogate likelihood, we 
designate "surrogate EM" for the case where A{6^ z) and 
A{Q) may both be approximated and the learning is done 
with a gradient-based method. To obtain a bound on 
the true EM loss, care is required. For example, lower- 
bounding A{S, z) using mean field, and upper-bounding 
A{6) using TRW means a lower-bound on the true EM 
loss. However, using the same approximation for both 
A{6) and A{6^z) appears to work well in practice EH- 

4.1.3 Saddle-Point Approximation 

A third approximation of the likelihood is to search for a 
"saddle-point". Here, one approximates the gradient in 
Eq.|20lby running a (presumably approximate) MAP in- 
ference algorithm, and then imagining that the marginals 
put unit probability at the approximate MAP solution, 
and zero elsewhere [27|, EHj, I^TJ. This is a heuristic 
method, but it can be expected to work well when the 
estimated MAP solution is close to the true MAP and 
the conditional distribution p(x|y) is strongly "peaked". 

4.1.4 Pseudolikelihood 

Finally, there are two classes of likelihood approxima- 
tions that do not require inference. The first is the classic 
pseudolikelihood |^|, where one uses 

L(^,x) = ^logp(x,|x_,;^). 

This can be computed efficiently, even in high 
treewidth graphs, since conditional probabilities are easy 
to compute. Besag [29J showed that, under certain con- 
ditions, this will converge to the true parameter vector 
as the amount of data becomes infinite. The pseudolike- 
lihood has been used in many applications |3Q| , 1311 . 
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Instead of the probability of individual variables given 
all others, one can take the probability of patches of 
variables given all others, sometimes called the "patch'' 
pseudolikelihood Il32]| . This interpolates to the exact 
likelihood as the patches become larger, though some 
type of inference is generally required. 

4.1.5 Piecewise Likelihood 

More recently, Sutton and McCallum |33| suggested the 
piecewise likelihood. The idea is to approximate the log- 
partition function as a sum of log-partition functions of 
the different "pieces" of the graph. There is flexibility in 
determining which pieces to use. In this paper, we will 
use pieces consisting of each clique and each variable, 
which worked better in practice than some alternatives. 
Then, one has the surrogate partition function 

A{e) = Y.Me) + Y.MO), 

c i 

Ac{e) = log^e^(^^\ Me)=\ogY,e'^"'^- 

It is not too hard to show that A{6) < A{6). In practice, 
it is sometimes best to make some heuristic adjustments 
to the parameters after learning to improve test-time 
performance |34|, |35|. 

4.2 Marginal-based Loss Functions 

Given the discussion in Section 14.11 one might conclude 
that the likelihood, while difficult to optimize, is an 
ideal loss function since, given a well-specified model, 
it will converge to the true parameters at asymptotically 
efficient rates. However, this conclusion is complicated 
by two issues. First, of course, the maximum likelihood 
solution is computationally intractable, motivating the 
approximations above. 

A second issue is that of model mis-specification. For 
many types of complex phenomena, we will wish to fit 
a model that is approximate in nature. This could be true 
because the conditional independencies asserted by the 
graph do not exactly hold, or because the parametriza- 
tion of factors is too simplistic. These approximations 
might be made out of ignorance, due to a lack of knowl- 
edge about the domain being studied, or deliberately 
because the true model might have too many degrees of 
freedom to be fit with available data. 

In the case of an approximate model, no "true" param- 
eters exist. The idea of marginal-based loss functions is 
to instead consider how the model will be used. If one 
will compute marginals at test-time - perhaps for MPM 
inference (Section I2.3|l - it makes sense to maximize the 
accuracy of these predictions. Further, if one will use 
an approximate inference algorithm, it makes sense to 
optimize the accuracy of the approximate marginals. This 
essentially fits into the paradigm of empirical risk min- 
imization [36], [37]. The idea of training a probabilistic 
model using an alternative loss to the likelihood goes 
back at least to Bahl et al. in the late 1980s l38l . 



There is reason to think the likelihood is somewhat 
robust to model mis-specification. In the infinite data 
limit, it finds the "closest" solution in the sense of KL- 
divergence since, if q is the true distribution, then 

KL{q\\p) = const. -Elogp(x;6/). 

4.2. 1 Univariate Logistic Loss 

The univariate logistic loss [39] is defined by 

L(6/,x) = -^log/i(xi;^), 

where we use the notation /i to indicate that the loss is 
implicitly defined with respect to the marginal predic- 
tions of some (possibly approximate) algorithm, rather 
than the true marginals. This measures the mean accu- 
racy of all univariate marginals, rather than the joint 
distribution. This loss can be seen as empirical risk 
minimization of the KL-divergence between the true 
marginals and the predicted ones, since 

i i Xi r \ 1 / 

= const. — E log 

If defined on exact marginals, this is a type of composite 
likelihood ||40l. 

4.2.2 Smootlied Univariate Classification Error 
Perhaps the most natural loss in the conditional setting 
would be the univariate classification error, 

L{0,x.) = S ( max ii{xi]0) - ii{xi;0)), 

where is the step function. This exactly measures the 
number of components of x that would be incorrectly 
predicted if using MPM inference. Of course, this loss 
is neither differentiable nor continuous, which makes it 
impractical to optimize using gradient-based methods. 
Instead Gross et al. [5] suggest approximating with a 
sigmoid function S{t) = (1 + exp(-at))~^, where a 
controls approximation quality. 

There is evidence [36], |5| that the smoothed classifi- 
cation loss can yield parameters with lower univariate 
classification error under MPM inference. However, our 
experience is that it is also more prone to getting stuck in 
local minima, making experiments difficult to interpret. 
Thus, it is not included in the experiments below. Our 
experience with the univariate quadratic loss [41] is 
similar. 

4.2.3 Clique Losses 

Any of the above univariate losses can be instead taken 
based on cliques. For example, the clique logistic loss is 

L(^,x) = -^log/i(x,;^), 



7 




-■likelihood 
■■■clique logistic 
— univariate logistic 



Figure 1: Mean test error of various loss functions 
trained with exact inference. In the case of a well- 
specified model (shift of zero), the likelihood performs 
essentially identically to the marginal-based loss func- 
tions. However, when mis-specification is introduced, 
quite different estimates result. 



which may be seen as empirical risk minimization of 
the mean KL-divergence of the true clique marginals to 
the predicted ones. An advantage of this with an exact 
model is consistency. Simple examples show cases where 
a model predicts perfect univariate marginals, despite 
the joint distribution being very inaccurate. However, if 
all clique marginals are correct, the joint must be correct, 
by the standard moment matching conditions for the 
exponential family |8|. 

4.2.4 Hidden variables 

Marginal-based loss functions can accommodate hidden 
variables by simply taking the sum in the loss over the 
observed variables only. A similar approach can be used 
with the pseudolikelihood or piecewise likelihood. 

4.3 Comparison with Exact Inference 

To compare the effects of different loss functions in the 
presence of model mis-specification, this section contains 
a simple example where the graphical model takes the 
following "chain'' structure: 




Here, exact inference is possible, so comparison is not 
complicated by approximate inference. 

All variables are binary. Parameters are generated by 
taking 0{xi) randomly from the interval [—1, +1] for all i 
and Xi. Interaction parameters are taken as 6{xi^Xj) = t 
when Xi = Xj, and 0{xi^Xj) = —t when Xi ^ Xj, where 
t is randomly chosen from the interval [—1,+!] for all 
(z, j). Interactions 0{yi, yj) and 0{xi,yi) are chosen in the 
same way. 

To systematically study the effects of differing 
"amounts'' of mis-specification, after generating data, 
we apply various circular shifts to x. Thus, the data 
no longer corresponds exactly the the structure of the 
graphical model being fit. 

Thirty-two different random distributions were cre- 
ated. For each, various quantities of data were generated 



by Markov chain Monte Carlo, with shifts introduced 
after sampling. The likelihood was fit using the closed- 
form gradient (Sec. I4.1|l , while the logistic losses were 
trained using a gradient obtained via backpropagation 
(Sec. 0. Fig. [l] shows the mean test error (estimated on 
1000 examples), while Fig. |2] shows example marginals. 
We see that the performance of all methods deteriorates 
with mis-specification, but the marginal-based loss func- 
tions are more resistant to these effects. 

4.4 MAP-Based Training 

Another class of methods explicitly optimize the perfor- 
mance of MAP inference 1121, [43], [44J, ||45|, p5]. This 
paper focuses on applications that use marginal infer- 
ence, and that may need to accommodate hidden vari- 
ables, and so concentrates on likelihood and marginal- 
based losses. 

5 Implicit Fitting 

We now turn to the issue of how to train high-treewidth 
graphical models to optimize the performance of a 
marginal-based loss function, based on some approxi- 
mate inference algorithm. Now, computing the value of 
the loss for any of the marginal-based loss functions is 
not hard. One can simply run the inference algorithm 
and plug the resulting marginal into the loss. However, 
we also require the gradient 

Our first result is that the loss gradient can be obtained 
by solving a sparse linear system. Here, it is useful to 
introduce notation to distinguish the loss L, defined in 
terms of the parameters 6 from the loss Q, defined di- 
rectly in terms of the marginals /j.. (Note that though the 
notation suggests the application to marginal inference, 
this is a generic result.) 

Theorem. Suppose that 



/J'{0) := argmax ■ fj, -\- H{ii). 



(22) 



Define L{d,^) = Q(/x(0),x). Then, letting D = 

^ = (D-'B^{BD-'B^)-'BD-' - D-')^. 
dO d/jL 

A proof may be found in Appendix B. This theorem 
states that, essentially, once one has computed the pre- 
dicted marginals, the gradient of the loss with respect 
to marginals ^ can be transformed into the gradient 
of the loss with respect to parameters ^ through the 
solution of a sparse linear system. 

The optimization in Eq. [22] takes place under linear 
constraints, which encompasses the local polytope used 
in TRW message-passing (Eq.[l5]). This theorem does not 
apply to mean field, as T is not a linear constraint set 
when viewed as a function of both clique and univariate 
marginals. 

In any case, the methods developed below are simpler 
to use, as they do not require explicitly forming the 
constraint matrix B or solving the linear system. 
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Figure 2: Exact and predicted marginals for an example input. Predicted marginals are trained using 1000 data. 
With low shifts, all loss functions lead to accurate predicted marginals. However, the univariate and clique logistic 
loss are more resistant to the effects of model mis-specification. Legends as in Fig. [H 



6 Perturbation 

This section observes that variational methods have a 
special structure that allows derivatives to be calculated 
without explicitly forming or inverting a linear system. 
We have, by the vector chain rule, that 



dL 

de 



dfj,^ dQ 
dO dfi 



(23) 



A classic trick in scientific computing is to efficiently 
compute Jacobian-vector products by finite differences. 
The basic result is that, for any vector v, 

^ V = lim - {^{0 + rv) - ^{0)) , 



de 



which is essentially just the definition of the derivative of 
fj, in the direction of v. Now, this does not immediately 
seem helpful, since Eg. l23l requires not How- 
ever, with variational methods, these are symmetric. The 
simplest way to see this is to note that 



dfj, 
dO^ 



d 



'dA\ 

de"" Ue) 



dA 
dOdO^ 



Domke f46] lists conditions for various classes of en- 
tropies that guarantee that A will be differentiable. 

Combining the above three equations, the loss gradi- 
ent is available as the limit 



= lim - (ll(6 - 

dO r^o r 



(24) 



In practice, of course, the gradient is approximated 
using some finite r. The simplest approximation, one- 
sided differences, simply takes a single value of r in 
Eq. O rather than a limit. More accurate results at the 



cost of more calls to inference, are given using two-sided 
differences, with 

dO 2r^^^ d^^ d^^^' 

which is accurate to order o(r^). Still more accurate 
results are obtained with "four-sided'' differences, with 

which is accurate to order o(r^) [47J. 

Alg. [T] shows more explicitly how the loss gradient 
could be calculated, using two-sided differences. 

The issue remains of how to calculate the step size 
r. Each of the approximations above becomes exact 
as r 0. However, as r becomes very small, nu- 
merical error eventually dominates. To investigate this 
issue experimentally, we generated random models on a 
10 X 10 binary grid, with each parameter 0{xi) randomly 
chosen from a standard normal, while each interaction 
parameter 0{xi^Xj) was chosen randomly from a normal 
with a standard deviation of s. In each case, a random 
value X was generated, and the "true" loss gradient 
was estimated by standard (inefficient) 2-sided finite 
differences, with inference re-run after each component 
of 6 is perturbed independently. To this, we compare 
one, two, and four-sided perturbations. In all cases, 
the step size is, following Andrei |48|, taken to be 
r = mes (l + ||^||oo)/||-^||oo, where e is machine epsilon, 
and m is a multiplier that we will vary. Note that the 
optimal power of e will depend on the finite difference 
scheme; | is optimal for two-sided differences [49| Sec. 
8.1]. All calculations take place in double-precision with 
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Algorithm 1 Calculating ^ by perturbation (two-sided). 

1) Do inference, /l^* ^ arg max - fi-^ H{iJi) 

2) At /jL*, calculate the gradient 

d/j, 

3) Calculate a perturbation size r. 

4) Do inference on perturbed parameters. 

a) /jL^ ^ arg max(^ + r^^) • /j, + 

b) /L6~ ^ arg max(^ — r^) • u + Hia) 

5) Recover full derivative as ^ < — — (m^ — LJi~). 

d& 2r 




10 10 10 
Perturbation Multiplier 



10 10 10 10 
Perturbation Multiplier 



10 10 10 10 
Perturbation Multiplier 



Figure 3: An evaluation of perturbation multipliers m. 
Top: TRW. Bottom: Mean field. Two effects are in play 
here: First, for too small a perturbation, numerical errors 
dominate. Meanwhile, for too large a perturbation, ap- 
proximation errors dominate. We see that using 2- or 4- 
sided differences differences reduce approximation error, 
leading to better results with larger perturbations. 



inference run until marginals changed by a threshold 
of less than 10~^^. Fig. |3] shows that using many-sided 
differences leads to more accuracy, at the cost of needing 
to run inference more times to estimate a single loss 
gradient. In the following experiments, we chose two- 
sided differences with a multiplier of 1 as a reasonable 
tradeoff between accuracy, simplicity, and computational 
expense. 

Welling and Teh used sensitivity of approximate be- 
liefs to parameters to approximate joint probabilities of 
non-neighboring variables 1501 . 

7 Truncated Fitting 

The previous methods for computing loss gradients 
are derived under the assumption that the inference 
optimization is solved exactly. In an implementation, of 
course, some convergence threshold must be used. 

Different convergence thresholds can be used in the 
learning stage and at test time. In practice, we have 
observed that too loose a threshold in the learning stage 
can lead to a bad estimated risk gradient, and learning 



terminating with a bad search direction. Meanwhile, a 
loose threshold can often be used at test time with few 
consequences. Usually, a difference of 10~^ in estimated 
marginals has little practical impact, but this can still be 
enough to prevent learning from succeeding [51]. 

It seems odd that the learning algorithm would spend 
the majority of computational effort exploring tight 
convergence levels that are irrelevant to the practical 
performance of the model. Here, we define the learning 
objective in terms of the approximate marginals obtained 
after a fixed number of iterations. To understand this, 
one may think of the inference process not as an op- 
timization, but rather as a large, nonlinear function. 
This clearly leads to a well-defined objective function. 
Inputting parameters, applying the iterations of either 
TRW or mean field, computing predicted marginals, 
and finally a loss are all differentiable operations. Thus, 
the loss gradient is efficiently computable, at least in 
principle, by reverse-mode automatic differentiation (au- 
todiff), an approach explored by Stovanov et al. 1361 , 
[52J. In preliminary work, we experimented with autod- 
iff tools, but found these to be unsatisfactory for our 
applications for two reasons. Firstly, these tools impose a 
computational penalty over manually derived gradients. 
Secondly, autodiff stores all intermediate calculations, 
leading to large memory requirements. The methods 
derived below use less memory, both in terms of con- 
stant factors and big-O complexity. Nevertheless, some 
of these problems are issues with current implementations 
of reverse-mode autodiff, avoidable in theory. 

Both mean field and TRW involve steps where we first 
take a product of a set of terms, and then normalize. We 
define a "backnorm'' operator, which is useful in taking 
derivatives over such operations, by 

backnorm(g, c) = c (g — g • c). 

This will be used in the algorithms here. More discussion 
on this point can be found in Appendix C. 



7.1 Back Mean Field 

The first backpropagating inference algorithm, back 
mean field, is shown as Alg. |2l The idea is as follows: 
Suppose we start with uniform marginals, run N itera- 
tions of mean field, and then- regardless of if mean field 
has converged or not- take predicted marginals and plug 
them into one of the marginal-based loss functions. Since 
each step in this process is differentiable, this specifies 
the loss as a differentiable function of model parameters. 
We want the exact gradient of this function. 

Theorem. After execution of back mean field, 

dL 



and V(xc) 



dO{xi) 



dO{^c 



A proof sketch is in Appendix C. Roughly speaking, 
the proof takes the form of a mechanical differentiation 
of each step of the inference process. 
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Algorithm 2 Back Mean Field 



1) Initialize /i uniformly. 

2) Repeat times for all j: 

a) Push the marginals /ij onto a stack. 

3) Compute L, V(^j) = and V(xc) = 



4) Initialize {xi) ^ 0, 6> (xj ^ 0. 

5) Repeat N times for all j (in reverse order): 



dfi{yic) ' 



backnorm(j&j, /ij) 



b) Y{xj) ^t{xj)^'v{xj) 

c) V(xc) ^ V(xc) + n Vc : j G c 

Xc\z /eGc\{i,j} 

\/c : j e c, Vi G c\j 

e) Pull marginals /ij from the stack. 

f) trj{xj)^o 



Note that, as written, back mean field only produces 
univariate marginals, and so cannot cope with loss func- 
tions making use of clique marginals. However, with 
mean field, the clique marginals, are simply the product 
of univariate marginals: /i(xc) = niGc/^(^*)- Hence, any 
loss defined on clique marginals can equivalently be 
defined on univariate marginals. 

7.2 Back TRW 

Next, we consider truncated fitting with TRW inference. 
As above, we will assume that some fixed number N 
of inference iterations have been run, and we want to 
define and differentiate a loss defined on the current 
predicted marginals. Alg. [3] shows the method. 



Theorem. After execution of back TRW, 



dL 

dO{x,) 



and 6 (xc 



dL 



dO{^c) ' 



Again, a proof sketch is in Appendix C. 

If one uses pairwise factors only, uniform appearance 
probabilities of p = 1, removes all reference to the stack, 
and uses a convergence threshold in place of a fixed 
number of iterations, one obtains essentially Eaton and 
Ghahramani's back belief propagation [53, extended 
version. Fig. 5]. Here, we refer to the general strategy 
of using full (non-truncated) inference as "backpropaga- 
tion'', either with LBP, TRW, or mean field. 

7.3 Truncated Likelihood & Truncated EM 

Applying the truncated fitting strategies to any of the 
marginal-based loss functions is simple. Applying it to 
the likelihood or EM loss, however, is not so straightfor- 
ward. The reason is that these losses (Eqs.[T9landl2T|) are 



Algorithm 3 Back TRW. 



1) Initialize m uniformly. 

2) Repeat N times for all pairs (c, i), with i e c: 

a) Push the messages mdxi) onto a stack. 

n rn^ixjVd 

b) m^) c Ex.,. '^^^^ UjecV e'^^^^ 

n maixiyd 



3) m(xc) 



-6'(Xe) 



Vc 
Vi 



5) Compute L, %{xi) = and 7r(xe) - 

6) For all c, 

a) t7(xc) ^ backnorm(^, /Uc) 

b) V(Xe) ^ ^l^(Xe) 

d) ^aix,) ^ Ex.,, ^ Vi e c, Vd : i e d 

7) For all i, 

a) '17{xi) backnorm(^,/ii) 

b) V(a;,) ^t7(a;,) 

c) tiTdixi) ^ Pd^^) W:ied 

8) Repeat N times for all pairs (c, i) (in reverse order) 

n rnaix^rd 



dL 



a) 5(x,) 



b) ^{xi) ^ backnorm(^n~,mci) 



C) ^Xe) ^J-.(Xe);^ 



Vj e c\i 



e) ^(x,-) ^ /M^^ V s(Xe)^^ 

/ u\ J J md{xj) ^yic\j ^ '-^^rricyxi) 

yj e c\i,\Jd : j e d 

f) Pull messages mdxi) from the stack. 

g) fffcixi) ^ 



defined, not in terms of predicted marginals, but in terms 
of partition functions. Nevertheless, we wish to compare 
to these losses in the experiments below. As we found 
truncation to be critical for speed, we instead derive a 
variant of truncated fitting. 

The basic idea is to define a "truncated partition func- 
tion''. This is done by taking the predicted marginals, 
obtained after a fixed number of iterations, and plugging 
them into the entropy approximations used either for 
mean field (Eq. [I2l) or TRW (Eq. [T6|. The approximate 
entropy H is then used in defining a truncated partition 
function as 

As we will see below, with too few inference iterations, 
using this approximation can cause the surrogate likeli- 
hood to diverge. To see why, imagine an extreme case 
where zero inference iterations are used. This results in 
the lossL(^,x) = ^•(f(x)-/L^0)+J^(/L60), where fi^ are the 
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initial marginals. As long as the mean of f(x) over the 
dataset is not equal to ijP, arbitrary loss can be achieved. 
With hidden variables, AiO^z) is defined similarly, but 
with the variables z "clamped'' to the observed values. 
(Those variables will play no role in determining fJ^{6)). 

8 Experiments 

These experiments consider three different datasets with 
varying complexity. In all cases, we try to keep the 
features used relatively simple. This means some sac- 
rifice in performance, relative to using sophisticated 
features tuned more carefully to the different problem 
domains. However, given that our goal here is to gauge 
the relative performance of the different algorithms, 
we use simple features for the sake of experimental 
transparency. 

We compare marginal-based learning methods to 
the surrogate likelihood /EM, the pseudolikelihood and 
piecewise likelihood. These comparisons were chosen 
because, first, they are the most popular in the literature 
(Sec. m. Second, the surrogate likelihood also requires 
marginal inference, meaning an "apples to apples" com- 
parison using the same inference method. Third, these 
methods can all cope with hidden variables, which 
appear in our third dataset. 

In each experiment, an "independent" model, trained 
using univariate features only with logistic loss was 
used to initialize others. The smoothed classification loss, 
because of more severe issues with local minima, was 
initialized using surrogate likelihood /EM. 

8.1 Setup 

All experiments here will be on vision problems, us- 
ing a pairwise, 4-connected grid. Learning uses the L- 
BFGS optimization algorithm. The values are linearly 
parametrized in terms of unary and edge features. For- 
mally, we will fit two matrices, F and G, which deter- 
mine all unary and edge features, respectively. These can 
be expressed most elegantly by introducing a bit more 
notation. Let Oi denote the set of parameter values 0{xi) 
for all values Xi. If u(y,z) denotes the vector of unary 
features for variable i given input image y, then 

ei=Fu{y,i). 

Similarly, let Oij denote the set of parameter values 
9{xi,Xj) for all xi.Xj. If v(y,i,j) is the vector of edge 
features for pair (i, j), then 



0^ 



Gv(y,i, j). 



Once ^ has been calculated (for whichever loss and 
method), we can easily recover the gradients with re- 
spect to F and G by 

dL — > dL , .rji dL \ — > dL , . 
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Table 1: Binary denoising error rates for different noise 
levels n. All methods use TRW inference with back- 
propagation based learning with a threshold of 10~^. 
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Figure 4: Predicted marginals for an example binary 
denoising test image with different noise levels n. 



8.2 Binary Denoising Data 

In a first experiment, we create a binary denoising 
problem using the Berkeley Segmentation Dataset. Here, 
we took 132 200 x 300 images from the Berkeley seg- 
mentation dataset, binarized them according to if each 
pixel is above the image mean. The noisy input values 
are then generated as i/i = Xi{l-t2) + (l-Xi)t^, where 
Xi is the true binary label, and ti e [0, 1] is random. 
Here, n e (l,oo) is the noise level, where lower values 
correspond to more noise. Thirty-two images were used 
for training, and 100 for testing. This is something of a 
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Table 2: Training and test errors on the horses dataset, 
using either TRW on mean-field (MNF) inference. With 
too few iterations, the surrogate likelihood diverges. 



toy problem, but the ability to systematically vary the 
noise level is illustrative. 

As unary features u(y, i), we use only two features: a 
constant of 1, and the noisy input value at that pixel. 

For edge features v(y,z, j), we also use two features: 
one indicating that (z, j) is a horizontal edge, and one 
indicating that is a vertical edge. The effect is 

that vertical and horizontal edges have independent 
parameters. 

For learning, we use full back TRW and back 
mean field (without message-storing or truncation) for 
marginal-based loss functions, and the surrogate likeli- 
hood with the gradient computed in the direct form (Eq. 
|20l) . In all cases, a threshold on inference of 10 ~^ is used. 

Error rates are shown in Tab. [H while predicted 
marginals for an example test image are shown in Fig. 
m We compare against an independent model, which 
can be seen as truncated fitting with zero iterations or, 
equivalently, logistic regression at the pixel level. We see 
that for low noise levels, all methods perform well, while 
for high noise levels, the marginal-based losses outper- 
form the surrogate likelihood and pseudolikelihood by 
a considerable margin. Our interpretation of this is that 
model mis-specification is more pronounced with high 
noise, and other losses are less robust to this. 

8.3 Horses 

Secondly, we use the Weizman horse dataset, consisting 
of 328 images of horses at various resolutions. We use 
200 for training and 128 for testing. The set of possible 
labels Xi is again binary- either the pixel is part of a 
horse or not. 

For unary features u(y, i), we begin by computing the 
RGB intensities of each pixel, along with the normalized 
vertical and horizontal positions. We expand these 5 ini- 
tial features into a larger set using sinusoidal expansion 
1541 . Specifically, denote the five original features by s. 
Then, we include the features sin(c-s) and cos(c-s) for all 
binary vectors c of the appropriate length. This results 
in an expanded set of 64 features. To these we append 
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Figure 5: Predicted marginals for a test image from the 



horses dataset. Truncated learning uses 40 iterations. 

a 36-component Histogram of Gradients |55|, for a total 
of 100 features. 

For edge features between i and j, we use a set of 
21 "base'' features: A constant of one, the I2 norm of 
the difference of the RGB values at i and j, discretized 
as above 10 thresholds, and the maximum response of 
a Sobel edge filter at i or j, again discretized using 10 
thresholds. To generate the final feature vector v(y, z, j), 
this is increased into a set of 42 features. If is 
a horizontal edge, the first half of these will contain 
the base features, and the other half will be zero. If 
(z,j) is a vertical edge, the opposite situation occurs. 
This essentially allows for different parametrization of 
vertical and horizontal edges. 

In a first experiment, we train models with truncated 
fitting with various numbers of iterations. The pseudo- 
likelihood and piecewise likelihood use a convergence 
threshold of 10~^ for testing. Several trends are visible 
in Tab. 121 First, with less than 20 iterations, the trun- 
cated surrogate likelihood diverges, and produces errors 
around 0.4. Second, TRW consistently outperforms mean 
field. Finally, marginal-based loss functions outperform 
the others, both in terms of training and test errors. 
Fig. [5] shows predicted marginals for an example test 
image. On this dataset, the pseudolikelihood, piecewise 
likelihood, and surrogate likelihood based on mean field 
are outperformed by an independent model, where each 
label is predicted by input features independent of all 
others. 

8.4 Stanford Backgrounds Data 

Our final experiments consider the Stanford back- 
grounds dataset. This consists of 715 images of resolu- 



13 







5 iters 


10 iters 


20 iters 


Loss 


A 


Train 


Test 


Train 


Test 


Train 


Test 


surrogate EM 


10~ 




.876 


.877 


.239 


.249 


.238 


.248 


univariate logistic 


10- 


3 


.210 


.228 


.202 


.224 


.201 


.223 


cli(][ue logistic 


10- 


3 


.206 


.226 


.198 


.223 


.195 


.221 


pseudolikelihood 


10- 


4 










.516 


.519 


piecewise 


10- 


-T- 










.335 


.341 


independent 


10- 












.293 


.299 



Table 3: Test errors on the backgrounds dataset using 
TRW inference. With too few iterations, surrogate EM 
diverges, leading to very high error rates. 
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Figure 6: Example marginals from the backgrounds 
dataset using 20 iterations for truncated fitting. 

tion approximately 240 x 320. Most pixels are labeled as 
one of eight classes, with some unlabeled. 

The unary features u(y, i) we use here are identical to 
those for the horses dataset. In preliminary experiments, 
we tried training models with various resolutions. We 
found that reducing resolution to 20% of the original 
after computing features, then upsampling the predicted 
marginals yielded significantly better results than using 
the original resolution. Hence, this is done for all the 
following experiments. Edge features are identical to 
those for the horses dataset, except only based on the 
difference of RGB intensities, meaning 22 total edge 
features v(y, j). 

In a first experiment, we compare the performance 
of truncated fitting, perturbation, and back-propagation, 
using 100 images from this dataset for speed. We train 
with varying thresholds for perturbation and back- 
propagation, while for truncated learning, we use vari- 
ous numbers of iterations. All models are trained with 
TRW to fit the univariate logistic loss. If a bad search- 
direction is encountered, L-BFGS is re-initialized. Results 
are shown in Fig. [71 We see that with loose thresholds, 
perturbation and back-propagation experience learning 
failure at sub-optimal solutions. Truncated fitting is far 
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Figure 7: Comparison of different learning methods on 
the backgrounds dataset with 100 images. All use an 8- 
core 2.26 Ghz PC. 

more successful; using more iterations is slower to fit, 
but leads to better performance at convergence. 

In a second experiment, we train on the entire dataset, 
with errors estimated using five-fold cross validation. 
Here, an incremental procedure was used, where first a 
subset of 32 images was trained on, with 1000 learn- 
ing iterations. The number of images was repeatedly 
doubled, with the number of learning iterations halved. 
In practice this reduced training time substantially. Re- 
sults are shown in Fig. [6l These results use a ridge 
regularization penalty of A on all parameters. (This is 
relative to the empirical risk, as measured per pixel.) 
For EM, and marginal based loss functions, we set this 
as A = 10~^. We found in preliminary experiments that 
using a smaller regularization constant caused truncated 
EM to diverge even with 10 iterations. The pseudolike- 
lihood and piecewise benefit from less regularization, 
and so we use A = 10~^ there. Again the marginal 
based loss functions outperform others. In particular, 
they also perform quite well even with 5 iterations, 
where truncated EM diverges. 

9 Conclusions 

Training parameters of graphical models in a high 
tree width setting involves several challenges. In this 
paper, we focus on three: model mis-specification, the 
necessity of approximate inference, and computational 
complexity. 

The main technical contribution of this paper is sev- 
eral methods for training based on the marginals pre- 
dicted by a given approximate inference algorithm. 
These methods take into account model mis-specification 
and inference approximations. To combat computational 
complexity, we introduce "truncated'' learning, where 
the inference algorithm only needs to be run for a fixed 
number of iterations. Truncation can also be applied, 
somewhat heuristically, to the surrogate likelihood. 

Among previous methods, we experimentally find the 
surrogate likelihood to outperform the pseudolikelihood 
or piecewise learning. By more closely reflecting the 
test criterion of Hamming loss, marginal-based loss 
functions perform still better, particularly on harder 
problems (Though the surrogate likelihood generally 
displays smaller train/ test gaps.) Additionally marginal- 
based loss functions are more amenable to truncation, as 
the surrogate likelihood diverges with too few iterations. 



14 



References 

[I] J. Besag, ''Spatial interaction and the statistical analysis of lattice 
systems/' Journal of the Royal Statistical Society. Series B (Method- 
ological), vol. 36, no. 2, pp. 192-236, 1974. 

[2] J. Lafferty, A. McCallum, and F. Pereira, "Conditional random 
fields: Probabilistic models for segmenting and labeling sequence 
data," in ICML, 2001. 

[3] M. Nikolova, "Model distortions in bayesian MAP reconstruc- 
tion," Inverse Problems and Imaging, vol. 1, no. 2, pp. 399-422, 2007. 

[4] J. Marroquin, S. Mitter, and T. Poggio, "Probabilistic solution 
of ill-posed problems in computational vision," Journal of the 
American Statistical Association, vol. 82, no. 397, pp. 76-89, 1987. 

[5] S. S. Gross, O. Russakovsky, C. B. Do, and S. Batzoglou, "Training 
conditional random fields for maximum labelwise accuracy," in 
MPS, 2007. 

[6] S. Kumar, J. August, and M. Hebert, "Exploiting inference for 
approximate parameter learning in discriminative fields: An em- 
pirical study," in EMMCVPR, 2005. 

[7] P. Kohli and P. Torr, "Measuring uncertainty in graph cut solu- 
tions," Computer Vision and Image Understanding, vol. 112, no. 1, 
pp. 30-38, 2008. 

[8] M. Wainwright and M. Jordan, "Graphical models, exponential 
families, and variational inference," Found. Trends Mach. Learn., 
vol. 1, no. 1-2, pp. 1-305, 2008. 

[9] T. Meltzer, A. Globerson, and Y. Weiss, "Convergent message 
passing algorithms - a unifying view," in UAI, 2009. 

[10] S. Nowozin and C. H. Lampert, "Structured learning and pre- 
diction in computer vision," Foundations and Trends in Computer 
Graphics and Vision, vol. 6, pp. 185-365, 2011. 

[II] H. Cramer, Mathematical methods of statistics. Princeton University 
Press, 1999. 

[12] S. Nowozin, P. V. Gehler, and C. H. Lampert, "On parameter 
learning in CRF-based approaches to object class image segmen- 
tation," in ECCV, 2010. 

[13] L. Stewart, X. He, and R. S. Zemel, "Learning flexible features for 
conditional random fields," IEEE Trans. Pattern Anal. Mach. Intelh, 
vol. 30, no. 8, pp. 1415-1426, 2008. 

[14] C. Geyer, "Markov chain monte carlo maximum likelihood," in 
Symposium on the Interface, 1991. 

[15] M. Carreira-Perpinan and G. Hinton, "On contrastive divergence 
learning," in AISTATS, 2005. 

[16] S. Roth and M. J. Black, "Fields of experts," International Journal 
of Computer Vision, vol. 82, no. 2, pp. 205-229, 2009. 

[17] M. J. Wainwright, "Estimating the "wrong" graphical model: 
benefits in the computation-limited setting," Journal of Machine 
Learning Research, vol. 7, pp. 1829-1859, 2006. 

[18] J. J. Weinman, L. C. Tran, and C. J. Pal, "Efficiently learning 
random fields for stereo vision with sparse message passing," 
in ECCV, 2008, pp. 617-630. 

[19] T. Toyoda and O. Hasegawa, "Random field model for integration 
of local information and global information," IEEE Trans. Pattern 
Anal. Mach. Intell, vol. 30, no. 8, pp. 1483-1489, 2008. 

[20] A. Levin and Y. Weiss, "Learning to combine bottom-up and 
top-down segmentation," International Journal of Computer Vision, 
vol. 81, no. 1, pp. 105-118, 2009. 

[21] S. Kumar, J. August, and M. Hebert, "Exploiting inference for 
approximate parameter learning in discriminative fields: An em- 
pirical study," in EMMCVPR, 2005. 

[22] X. Ren, C. Fowlkes, and J. Malik, "Figure /ground assignment in 
natural images," in ECCV, 2006. 

[23] S. V. N. Vishwanathan, N. N. Schraudolph, M. W Schmidt, and 
K. P. Murphy, "Accelerated training of conditional random fields 
with stochastic gradient methods," in ICML, 2006. 

[24] X. Ren, C. Fowlkes, and J. Malik, "Learning probabilistic models 
for contour completion in natural images," International Journal of 
Computer Vision, vol. 77, no. 1-3, pp. 47-63, 2008. 

[25] J. Yuan, J. Li, and B. Zhang, "Scene understanding with discrim- 
inative structured prediction," in CVPR, 2008. 

[26] J. J. Verbeek and B. Triggs, "Scene segmentation with crfs learned 
from partially labeled images," in MPS, 2007. 

[27] D. Scharstein and C. Pal, "Learning conditional random fields for 
stereo," in CVPR, 2007. 

[28] P. Zhong and R. Wang, "Using combination of statistical models 
and multilevel structural information for detecting urban areas 
from a single gray-level image," IEEE T. Geoscience and Remote 
Sensing, vol. 45, no. 5-2, pp. 1469-1482, 2007. 



[29] J. Besag, "Statistical analysis of non-lattice data," Journal of the 
Royal Statistical Society. Series D (The Statistician), vol. 24, no. 3, 
pp. 179-195, 1975. 

[30] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, "Multiscale 
conditional random fields for image labeling," in CVPR, 2004. 

[31] S. Kumar and M. Hebert, "Discriminative random fields," Interna- 
tional Journal of Computer Vision, vol. 68, no. 2, pp. 179-201, 2006. 

[32] S. C. Zhu and X. Liu, "Learning in gibbsian fields: How accurate 
and how fast can it be?" IEEE Transactions on Pattern Analysis and 
Machine Intelligence, vol. 24, pp. 1001-1006, 2002. 

[33] C. Sutton and A. McCallum, "Piecewise training for undirected 
models," in UAI, 2005. 

[34] S. Kim and I.-S. Kweon, "Robust model-based scene interpreta- 
tion by multilayered context information," Computer Vision and 
Image Understanding, vol. 105, no. 3, pp. 167-187, 2007. 

[35] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, "Textonboost 
for image understanding: Multi-class object recognition and seg- 
mentation by jointly modeling texture, layout, and context," Int. 
J. of Comput. Vision, vol. 81, no. 1, pp. 2-23, 2009. 

[36] V. Stoyanov, A. Ropson, and J. Eisner, "Empirical risk minimiza- 
tion of graphical model parameters given approximate inference, 
decoding, and model structure," in AISTATS, 2011. 

[37] J. Domke, "Learning convex inference of marginals," in UAI, 2008. 

[38] L. R. Bahl, P F Bron, P V. de Souza, and R. L. Mercer, "A new al- 
gorithm for the estimation of hidden markov model parameters," 
in ICASSP, 1988. 

[39] S. Kakade, Y W. Teh, and S. Roweis, "An alternate objective 

function for Markovian fields," in ICML, 2002. 
[40] B. G. Lindsay, "Composite likelihood methods," Contemporary 

Mathematics, vol. 80, pp. 221-239, 1988. 
[41] J. Domke, "Learning convex inference of marginals," in UAI, 2008. 
[42] C. Desai, D. Ramanan, and C. C. Fowlkes, "Discriminative models 

for multi-class object layout," International Journal of Computer 

Vision, vol. 95, no. 1, pp. 1-12, 2011. 
[43] M. Szummer, P. Kohli, and D. Hoiem, "Learning CRFs using 

graph cuts," in ECCV, 2008. 
[44] J. J. McAuley, T. E. de Campos, G. Csurka, and F Perronnin, 

"Hierarchical image-region labeling via structured learning," in 

BMVC, 2009. 

[45] W. Yang, B. Triggs, D. Dai, and G.-S. Xia, "Scene segmentation 
via low-dimensional semantic representation and conditional ran- 
dom field," HAL, Tech. Rep., 2009. 

[46] J. Domke, "Implicit differentiation by perturbation," in MPS, 
2010. 

[47] A. Boresi and K. Chong, Approximate Solution Methods in Engineer- 
ing Mechanics. Elsevier Science Inc., 1991. 

[48] N. Andrei, "Accelerated conjugate gradient algorithm with fi- 
nite difference hessian/ vector product approximation for uncon- 
strained optimization," /. Comput. Appl. Math., vol. 230, no. 2, pp. 
570-582, 2009. 

[49] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. 
Springer, 2006. 

[50] M. Welling and Y W. Teh, "Linear response algorithms for 
approximate inference in graphical models," Neural Computation, 
vol. 16, pp. 197-221, 2004. 

[51] J. Domke, "Parameter learning with truncated message-passing," 
in CVPR, 2011. 

[52] V. Stoyanov and J. Eisner, "Minimum-risk training of approximate 
CRF-based NLP systems," in Proceedings of NAACL-HLT. 

[53] F. Eaton and Z. Ghahramani, "Choosing a variable to clamp," in 
AISTATS, 2009. 

[54] G. Konidaris, S. Osentoski, and P. Thomas, "Value function ap- 
proximation in reinforcement leanring using the fourier basis," in 
AAAI, 2011. 

[55] N. Dalai and B. Triggs, "Histograms of oriented gradients for 
human detection," in CVPR, 2005. 



10 Biography 

Justin Domlce obtained a FhD degree in Computer Sci- 
ence from tlie University of Maryland, College Park in 
2009. From 2009 to 2012, he was an Assistant Professor 
at Rochester Institute of Technology. Since 2012, he is a 
member of the Ivlachine Learning group at NICTA. 



15 



11 Appendix A: Variational Inference 

Theorem (Exact variational principle). The log-partition 
function can also he represented as 



A{e) = max 6 ■ fi^ H{fi), 



where 



is the marginal polytope, and 

X 

is the entropy. 

Proof of the exact variational principle: As A is convex, 
we have that 

A{e) =SUp^-/L^- A*(/L6) 



where 



A%^ji) = mie-^ji-A{e) 



is the conjugate dual. 

Now, since dA/dO = iJi{6), '\{ ii then the infimum 
for A* is unbounded above. For fj, e the infimum 
will be obtained at 6{fi). Thus 



A* in) 



oo 



/jl ^ M 



[0{Ijl)-^-A{0{^)) /jieM 
Now, for fj. e we can see by substitution that 

A*{,x) =0{^l) ■ Y,PMpi))i{-s.) - A{0(pi)) 

X 

= ^p(x; e{fi)){9{ti) ■ f (x) - A{9{ti))) 

X 

= ^p(x;0(/x))logp(x;0(/x)) = -ff(^). 

X 

And so, finally, 

which is equivalent to the desired result. 

Theorem (Mean Field Updates). A local maximum of Eq. 
[Til can he reached hy iterating the updates 

c:jGcXc\j iec\j 

where Z is a normalizing factor ensuring that /J^{xj) = 1. 

Proof of Mean Field Updates: The first thing to note is 
that for fjL e J^, several simplifying results hold, which 
are easy to verify, namely 



(25) 



□ 



i 

i Xi 



Now, let A denote the approximate partition function 
that results from solving Eq. [H] with the marginal poly- 
tope replaced by T. By substitution of previous results, 
we can see that this reduces to an optimization over 
univariate marginals only. 

A{e) = max V^^(x,)/i(xz) + ^^^(xc)/i(x,) 
- ^ ^ \ogfi{xi). (26) 

i Xi 

Now, form a Lagrangian, enforcing that iJi{xj) = 1. 

j Xj c Xc iec 

J ^3 J ^3 

Setting d\^/dii{xj) = 0, solving for iJi{xj), we obtain 
^i{xj) (xexp((9(x^) + X X]6>(xc) W fi{xi)), 

c:jecyia\j iec\j 

meaning this is a local minimum. Normalizing by Z 
gives the result. 

Note that only a local maximum results since the 
mean-field objective is non-concave|8. Sec. 5.4]. □ 

Two preliminary results are needed to prove the TRW 
entropy bound. 

Lemma. Let fj,^ he the ''projection' of ii onto a suhgraph Q, 
defined hy 

= Vz} U {/i(xe) Vc G e?}. 

Then, for e M, 

Proof: First, note that, by Eq. |25l ior fi e M, 
H{fi) = -A*{^) = - M{d ■ M - A{e)). 

Now, the entropy of fi^ could be defined as an infi- 
mum only over the parameters 6 corresponding to the 
cliques in Q. Equivalently, however, we can define it as 
a constrained optimization 

H{tM^) = - M (e-fj,^-A{0)). 

Since the infimum for H{^^) takes place over a more 
constrained set, but ^ and ii^ are identical on all the 
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components where 6 may be nonzero, we have the 
result. □ 
Our next result is that the approximate entropy con- 
sidered in Eq. [I6l is exact for tree-structured graphs, 
when Pc = I- 

Lemma. For e Mfor a marginal polytope M correspond- 
ing to a tree-structured graph, 

H{ti) = Y,H{^i)-Y,I{^c)- 

i c 

Proof: First, note that for any any tree structured 
distribution can be factored as 

(This is easily shown by induction.) Now, recall our 
definition of H\ 

H{fi) = - ^p(x; e{fi)) logp(x; e{fi)) 

X 

Substituting the tree-factorized version of p into the 
equation yields 

H{fi) = -^p(x;^(A^))logp(x;^(Ax)) 

X 

X i 

□ 

Finally, combining these two lemmas, we can show the 
main result, that the TRW entropy is an upper bound. 

Theorem (TRW Entropy Bound). Let Pr{Q) he a distribu- 
tion over tree structured graphs, and define pc = Pr(c G G)- 
Then, with H as defined in Eq. [TH 

Proof: The previous Lemma shows that for any 
specific tree ^, 

Thus, it follows that 

H{^l) < Y.Pr{g)H{^l^) 

Q 



ceQ 



= ^H{pi) -^pj{pc) 



Theorem (TRW Updates). Let pc he as in the previous 
theorem. Then, if the updates in Eq. [TJ] reach a fixed point, 
the marginals defined by 

d:i^d 

constitute the global optimum of Eq. [I3l 

Proof: The TRW optimization is defined by 

A{e) = mdixO ' fi^ H{fi). 
Consider the equivalent optimization 



max^ ■ p -\- H{p) 

s.t. 1 = y^/i(xi) 



p{Xi). 



which makes the constraints of the local polytope ex- 
plicit 

First, we form a Lagrangian, and consider derivatives 
with respect to fi, for fixed Lagrange multipliers. 



L =^ • /i + H{p) + ^ A,(l - ^ p{x,)) 

i Xi 

+ E E -^'=(^*) (E '^(^'=) ~ /^(^i)) 



lec X. 

- 1 - log/i(Xc)) + ^ Ac(Xi) 



dL 



dp{xi) 



{xi) - 1 - log p{xi) - - ^ Ac(a 



c:iGc 



Setting these derivatives equal to zero, we can solve 
for the log-marginals in terms of the Lagrange multipli- 
ers: 



Pc log /i(Xc) =6>(Xc) +Pc(X](l +log^/i(Xi,x'_j) - 1) 

\ogp{xi) =6{xi) -1- Xi - ^ Xc{xi) 



c:iEc 



□ 



Now, at a solution, we must have that p{xi) 
Zlx \ m(^c)- This leads first to the the constraint that 
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log/i(xc) =— 6>(Xc) + ^(l + log/i(xi) + —Xc{xi)) - 1 
iec 

=-^(x,) + ^(1 + ^(xO - 1 - A, - ^ Xc{xi) 



c:iEc 



+ —K{xi)) - 1. 

Now, define the "messages'' in terms of the Lagrange 
multipliers as 

mc{xi) = e"^^^^^^\ 

If the appropriate values of the messages were known, 
then we could solve for the clique-wise marginals as 

m(x,) oc e^''(''=)TTe^(--)exp(-Ac(xi)) 
tec 

X Y\ exp(-Arf(a;i))) 

d:i^d 

The univiariate marginals are available simply as 

IJL{xi) (X exp(6>(xi) - ^ \d{xi)) 

d:ied 



d:ied 



We may now derive the actual propagation. At a 
valid solution, the Lagrange multipliers (and hence the 
messages) must be selected so that the constraints are 
satisfied. In particular, we must have that = 
/i(xc). From the constraint, we can derive con- 
straints on neighboring sets of messages. 



i-.ieo 



mc{xi) 



(27) 



Now, the left hand side of this equation cancels one 
term from the product on line |27l except for the denom- 
inator of mc{xi). This leads to the constraint of 



jec\i 



(,^) Ud:jed^d{Xjy 

mc{xj) 



This is exactly the equation used as a fixed-point equa- 
tion in the TRW algorithm. □ 



12 Appendix B: Implicit Differentiation 

Theorem. Suppose that 

li{0) := argmax 6 /j. -\- H{ii). 

Define 1(6, x) = Q{n{e),x). Then, letting D = j^^, 

^ = (D-'B^{BD-'B^)-'BD-' - D-')^. 
du all 

Proof: First, recall the implicit differentiation theo- 
rem. If the relationship between a and b is implicitly 
determined by f (a, b) = 0, then 



da da ^ db ^ 

In our case, given the Lagrangian 

L = 6/ • + H{ii) + X^{Biji - d), 

Our implicit function is determined by the constraints 
that ^ = and ^ = 0. That is, it must be true that 

dL ^ dH ^rj. ^ 
d/j, dfi 



dh 



d = 0. 



Thus, our implicit function is 



_ dH 

dfi 







A 





B^X 





■ 








Taking derivatives, we have that 

n T 

d ^ 

A 



dO 



^ dO ^\ ^ \ a]J 

d \r 

A 



Taking the terms on the right hand side in turn, we 
have 



df^ 
dO 



Bfi-d 



dO 



df^ 



A 



d^H 
dfidfi'^ 

B 







A 



de 



' I ' 


T 








d^H 
dfidfi'^ 

B 



B^ 




(28) 



This means that — is the upper-left block of the 
inverse of the matrix on the right hand side of Eq. |28l It 
is well known that if 



M 



E F 
G H 
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then the upper-left block of M ^ is 

E-^ ^ E-^F{H-GE-^F)-^GE-^. 
So, we have that 

"^^^ D-'B^{BD-'B^)BD-' - D-\ (29) 



dO 

^l^ere D := 

The result follows simply from substituting Eq. [ 
the chain rule 

dL d/jp^ dQ 
de ~ ~dF~dil' 



J into 



□ 



13 Appendix C: Truncated Fitting 

Several simple lemmas will be useful below. A first one 
considers the case where we have a "product of powers''. 

Lemma (Products of Powers). Suppose that b = Yl^a^'. 
Then 

db _Pi^ 
dai ai 

Next, both mean-field and TRW involve steps where 
we first take a product of a set of terms, and then 
normalize. The following lemma is useful in dealing 
with such operations. 

Lemma (Normalized Products). Suppose that bi = Ylj aij 

and Ci = bi/ aij . Then, 



dci 



da 



'jk 



(li=j q) 



Corollary. Under the same conditions, 



dL Cj / dL dL . 



ddji^ CLj]^ dc 



dcj 



Accordingly, we find it useful to define the operator 

backnorm(g, c) = c (g - g • c). 

This can be used as follows. Suppose that we have 
calculated ^ = ^ - Then, if we set 17 = backnorm(^, c), 
and we have that 

An important special case of this is where ajk = 
expfjk. Then, we have simply that = t^. 

Another important special case is where ajk = /^^. 



Then, we have that = pff, ^, and so = p^. 
Theorem. After execution of hack mean field. 



dL 



dL T t^/ X dL 
ana 9 (xc) 



dO{xi) 



dO{^c) 



Proof sketch: The idea is just to mechanically dif- 
ferentiate each step of the algorithm, computing the 
derivative of the computed loss with respect to each 
intermediate quantity. First, note that we can re-write 
the main mean-field iteration as 



/i(x,)(xexp(^(x,)) H nexp(^(xe) JJ /^(^O))- (30) 



c:jGcXc\j 



iec\j 



Now, suppose we have the derivative of the loss 
with respect to this intermediate vector of marginals 
We wish to "push back" this derivative on the values 
affecting these marginals, namely 0{xj), 0{-Kc) (for all 
c such that j G c), and (for all i ^ j such that 

3c : {ijj} G c). To do this, we take two steps: 

1) Calculate the derivative with respect to the value 
on the righthand side of Eq. |30] before normalization. 

2) Calculate the derivative of this un-normalized 
quantity with respect to 0{xj), 0{-Kc) and 

Now, define Vj to be the vector of marginals produced 
by Eq. |3Q1 before normalization. Then, by the discussion 
above, ^ = backnorm(^, /[x^). This completes step 1. 

Now, with t7~j in hand, we can immediately calculate 
the backpropagation of ^ on 6> as 

This, follows from the fact that ^ = ^e", where 0{xj) 
plays the role of a, and plays the role of y{xj). 
Similarly, we can calculate that 



dL 



Thus, since 



de{^c) 



'v{xj). 



n 



we have that 



iec\j 

Similarly, for any Xc that "matches" Xi (in the sense 
that the same value Xi is present as the appropriate 
component of Xc), 



dOj^c) Ukec\j /^(^fc) 
dfi{xi) 

From which we have 



kec\{i,j} 



Xe\i kec\{i,j} 

meaning this is a local minimum. Normalizing by Z 
gives the result. □ 

Theorem. After execution of back TRW, 
^{xi) = and V(xc) 



de{xi) 



dO{^c) ' 



Proof sketch: Again, the idea is just to mechani- 
cally differentiate each step of the algorithm. Since the 
marginals are derived in terms of the messages, we 
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must first take derivatives with respect to the marginal- 
producing steps. First, consider step 3, where predicted 
clique marginals are computed. Defining V(xc) = 
backnorm(jfc, /ic), we have that 



In terms of the incoming messages, consider the up- 
date to tn~d{xj), where j ^ i, j ^ d, and d ^ c. This will 
be 



(xc) = — l7(xc) 

Pc 



E 



dL dm^{xi) 
dm^{xi) dmd{xj) 



{xi) = ^l7(xc) 

Next, consider step 4, where predicted univari- 
ate marginals are computed. Defining, '^{xi) = 
backnorm()ii, /ii), we have 



Pd 



mc{xi) 
Pd 



md{xj) 



s(xc) 



md{xj) 



y.(x.)^. 



^d{Xi) 



^{Xi) 

md{xi)' 



Pd 



Finally, we consider the main propagation, in step 2. 
Here, we recompute the intermediate quantity 



n ^d{xj)f' 



jec\i 



rucixj) 



After this, consider the step where when pair (c, i) is 
being updated. We first compute 

dL ^{xi) 
dm^{ci) rricixi)^ 

where m^(xi) is defined as the value of the marginal 
before normalization, and 

^{xi) = backnorm(^n^, mcz). 

(See the Normalized Products Lemma above.) Given 
this, we can consider the updates required to gradients 
of ^(xc), 0{xi) and md{xj) in turn. 
First, we have that the update to (xc) should be 

dL dm^{-Ki) 
dm^{xi) dO{xc) 

^{X^) 1 



mc{xi) p, 



-5(xc), 



which is the update present in the algorithm. 
Next, the update to (xj) should be 



E 



dL dm^{'Ki) 
dm^{xi) dO{xj) 



= y^.(x.). 



Finally, consider the update to ^nc{xj), where j ^ i. 
This will have the previous update, plus the additional 
term, considering the presence of mc{xj) in the denom- 
inator of the main TRW update, of 



Pd 



mc{xi)^ md{xj) 



5(Xc) 



Pd 



E Pt^a^c). 



md{xj) ^ mc{xi) 

Now, after the update has taken place, the messages 
mc{xi) are reverted to their previous values. As these 
values have not (yet) influenced any other variables, 
they are initialized with frTc^Xi) = 0. □ 
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Figure 8: Predicted marginals for an example binary Figure 9: Predicted marginals for an example binary 
denoising test image with different noise levels n. denoising test image v^ith different noise levels n. 
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(b) True Labels 
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(f) Surr. Like. MNF (g) U. Logistic MNF (h) Independent 

Figure 11: Predicted marginals for a test image from the 
horses dataset. Truncated learning uses 40 iterations. 
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(c) Surr. Like. TRW (d) U. Logistic TRW (e) Sm. Class A=50 TRW 



Figure 10: Predicted marginals for an example binary 
denoising test image with different noise levels n. 




(f) Surr. Like. MNF (g) U. Logistic MNF (h) Independent 

Figure 12: Predicted marginals for a test image from the 
horses dataset. Truncated learning uses 40 iterations. 
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(a) Input Image (b) True Labels 




(c) Surrogate EM (d) Univ. Logistic (e) Clique Logistic 




(f) Pseudolikelihood (g) Piecewise (h) Independent 



Figure 13: Example marginals from the backgrounds 
dataset using 20 iterations for truncated fitting. 
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(a) Input Image (b) True Labels 




(c) Surrogate EM (d) Univ. Logistic (e) Clique Logistic 

(f) Pseudolikelihood (g) Piecewise (h) Independent 

Figure 15: Example marginals from the backgrounds 
dataset using 20 iterations for truncated fitting. 
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Figure 14: Example marginals from the backgrounds 
dataset using 20 iterations for truncated fitting. 



